Exivity Kubernetes best practices

This document describes the recommended Kubernetes deployment patterns for Exivity in self-managed, on-premises environments. It is intended as a prescriptive starting point for implementation teams that need a default architecture for single-node, multi-node, and multi-site deployments.

The recommendations below assume a Linux Kubernetes cluster, Helm-based deployment using the Exivity chart, and self-managed infrastructure components such as ingress, storage, PostgreSQL, and RabbitMQ.

Third-party middleware

Exivity relies on third-party infrastructure and middleware to run on Kubernetes, including Kubernetes, PostgreSQL, RabbitMQ, ingress controllers such as NGINX Ingress Controller or Traefik, and storage platforms such as Longhorn or NFS-backed storage.

These products are third-party infrastructure that you operate. Exivity documents how the application depends on them, but you are responsible for selecting, operating, securing, backing up, monitoring, and supporting those third-party products according to their vendor documentation and your internal platform standards.

Deployment scenarios

Scenario	Recommended use	Preferred architecture
☸️ Single-node Kubernetes	Small environments, evaluation, non-HA production where simplicity is preferred	One Kubernetes node, ingress/TLS, embedded or external PostgreSQL, embedded RabbitMQ, provisioner-backed local RWO storage (or RWX if already available)
☸️ Multi-node Kubernetes	Production HA deployments within one site	Multi-node Kubernetes, Longhorn RWX storage, external PostgreSQL, site-local in-cluster RabbitMQ, ingress/load balancer
☸️ Multi-site Kubernetes	Disaster recovery across sites	Active/passive sites with replicated PostgreSQL, independent RabbitMQ per site, independent storage per site, GitOps-controlled failover

Common foundations

Use these foundations for all Kubernetes scenarios.

Area	Recommendation
☸️ Kubernetes	Use a CNCF-conformant Kubernetes distribution on Linux nodes with a stable CSI driver and production ingress controller. Known Exivity deployments run on upstream Kubernetes, Rancher (RKE2/K3s), and Red Hat OpenShift. Other CNCF-conformant distributions are likely to work but should be confirmed with Exivity support before production use. Lightweight learning distributions such as Minikube, Kind, and Docker Desktop are intended for development only.
⎈ Helm	Use Helm 3 and maintain deployment values in version control.
🗂️ Namespace	Deploy Exivity into a dedicated namespace, normally `exivity`.
🚦 Ingress / load balancer	Use a production ingress controller such as NGINX or Traefik. Terminate TLS at ingress or at an upstream load balancer.
🛡️ TLS	Use cert-manager, enterprise PKI, or an existing TLS secret. Do not expose production Exivity over plain HTTP.
🔐 Secrets	Set `secret.appKey` and `secret.jwtSecret` explicitly for production. Do not rely on generated values.
📦 Image registry	Mirror images to an internal registry for restricted or air-gapped sites.
🔄 Backups	Back up PostgreSQL and Exivity shared data. Do not rely only on persistent volumes for recovery.
📈 Monitoring	Enable Kubernetes, ingress, PostgreSQL, RabbitMQ, and storage monitoring. Enable the Exivity ServiceMonitor where Prometheus Operator is used.
📄 Logs	Tune log retention with `logfiles.deleteDays` and `logfiles.compressDays` to match your retention and storage requirements.

PVC sizing

The chart defaults are intentionally small and are usually not appropriate for production. Use the following as starting values and size upward for high-volume or multi-tenant environments.

PVC group	Volume	Recommended size
📚 Data	`extracted`	50-100Gi
📚 Data	`exported`	50-100Gi
📚 Data	`import`	10-20Gi
📚 Data	`report`	10-20Gi
📄 Logs	All service log PVCs	5-10Gi each
⚙️ Config	`etl`, `griffon`, `chronos`	1Gi
🐘 PostgreSQL	Embedded PostgreSQL or CloudNativePG instance volume	25-50Gi

For CSI-backed storage such as Longhorn, PVC sizes are enforced by the storage backend. For NFS-subdir provisioners, PVC sizes may be advisory only, but they should still be set to document intent and simplify future migration to CSI-backed storage. For local-path style provisioners on single-node deployments, PVC sizes are advisory and not reserved against node disk capacity, so always validate the sum of requested PVC sizes against the actual node disk size and monitor node disk free space.

extracted and exported typically grow fastest because they depend on data source volume, retention, and report frequency. Prefer 100Gi for larger environments or high-frequency reporting.

Scenario A: single-node Kubernetes

Single-node Kubernetes is suitable when HA is not required or when you want the smallest possible Kubernetes footprint. It is also useful for demos, evaluation, and small production environments with clear recovery expectations.

Architecture

This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.

Layer	Role in this scenario
👥 Users / API clients	Reach Exivity through DNS and the cluster ingress endpoint.
🛡️ Ingress / TLS	Routes `/` to `glass` and API paths to `proximity-api`; terminates TLS.
☸️ Kubernetes node	Single node hosting the control-plane, worker role, all Exivity services, RabbitMQ, PostgreSQL, and shared storage.
🐘 PostgreSQL	Either embedded (in-cluster) or external; both stay in the same single-node footprint.
🐇 RabbitMQ	Embedded in-cluster RabbitMQ used for transient communication.
💾 Shared volumes	Hold logs, config, and pipeline data; mounted into the Exivity services running on this node.

Configuration

This table lists the choices to make for a single-node deployment. It is prescriptive.

Decision	Recommended value
☸️ Kubernetes	One Linux node running both control-plane and worker roles.
🐘 PostgreSQL	External PostgreSQL is preferred. Embedded PostgreSQL is acceptable for evaluation and small environments.
🐇 RabbitMQ	Use site-local in-cluster RabbitMQ. The embedded chart dependency is acceptable for evaluation. For production, prefer the RabbitMQ Cluster Operator running a single RabbitMQ node, because the embedded chart relies on the unsupported `bitnamilegacy` image. See the RabbitMQ section for details.
💾 Storage access mode	RWX is not required. Set `storage.sharedVolumeAccessMode: ReadWriteOnce` because every Exivity pod runs on the same node.
💾 Storage class	Use a provisioner-backed local StorageClass. Validated examples include Docker Desktop's `hostpath`, K3s' built-in `local-path`, and `local-path-provisioner`. Do not point Exivity directly at unmanaged raw `hostPath` volumes; always go through a StorageClass/provisioner. NAS/NFS is a valid alternative when you already operate reliable NAS, want to decouple storage from the Kubernetes node, want easier node rebuild/replacement, or anticipate migrating to multi-node later. NAS/NFS does not make Exivity HA when Kubernetes itself is still single-node. Longhorn works on a single node but provides limited HA value there because replicas cannot be spread across nodes.
🚦 Ingress / load balancer	Any CNCF-conformant Kubernetes ingress controller with TLS termination is supported. Proven options include Traefik, NGINX Ingress Controller, and HAProxy Ingress. Reach the cluster through a `LoadBalancer` service provided by your platform (cloud provider's native load balancer or an upstream hardware load balancer). On bare-metal Kubernetes without a cloud provider, an implementation such as MetalLB can fill that role; treat it as one option among hardware and software load balancers and confirm operational fit with your platform team.
🔄 Backups	Back up PostgreSQL and shared data. Test the restore path before production handover.

Single-node disk capacity

Local-path style provisioners do not track or reserve aggregate disk capacity across PVCs on the node. The sum of requested PVC sizes can exceed available disk without Kubernetes blocking it. Size all Exivity PVCs against the actual node disk capacity, leave headroom for PostgreSQL, logs, and image growth, and monitor node disk free space.

Reference values: charts/exivity/examples/best-practice-single-node.yaml

Scenario B: multi-node Kubernetes, single site

Multi-node Kubernetes is the preferred architecture for production HA environments within one site. This is the default recommendation for larger deployments.

Architecture

This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.

Layer	Role in this scenario
👥 Users / API clients	Reach Exivity through an external load balancer in front of cluster ingress.
🛡️ Ingress / TLS	Routes traffic to multiple stateless Exivity replicas; terminates TLS.
☸️ Kubernetes worker nodes	Multiple nodes spread across failure domains; host the Exivity application tier and middleware.
🧩 Application tier	Stateless services (frontend, API, backend) run with multiple replicas; workflow and ETL services run as singletons.
🐘 PostgreSQL	External or in-cluster Kubernetes-native PostgreSQL serving the active workload from the same low-latency site.
🐇 RabbitMQ	Site-local in-cluster RabbitMQ used for transient communication.
💾 Shared storage	RWX-capable storage shared across nodes; holds logs, config, and pipeline data.

Configuration

This table lists the choices to make for a multi-node single-site deployment. It is prescriptive.

Decision	Recommended value
☸️ Kubernetes	Use at least three worker nodes. For HA control-plane requirements, also use three control-plane nodes.
📍 Node placement	Spread nodes across racks, chassis, failure domains, or availability zones where available.
🐘 PostgreSQL	Use external PostgreSQL for production. For self-hosted Kubernetes PostgreSQL, use CloudNativePG.
🐇 RabbitMQ	Run RabbitMQ site-local in-cluster. Use the RabbitMQ Cluster Operator for production because the embedded chart dependency relies on the unsupported `bitnamilegacy` image. External or managed RabbitMQ is optional when required by your platform standards. See the RabbitMQ section for details.
💾 Storage access mode	RWX is required because Exivity pods run across multiple nodes. Keep `storage.sharedVolumeAccessMode: ReadWriteMany`.
💾 Storage class	Prefer Longhorn with three replicas per volume. An HA NAS/NFS platform that exposes RWX is a valid alternative when Longhorn or an equivalent CSI RWX storage class is not available. Avoid using a simple in-cluster NFS server (for example, the NFS Ganesha server and external provisioner backed by a single PVC) as the HA default unless its backing storage and node placement are explicitly designed for HA.
🚦 Load balancer	Use a hardware load balancer or your existing load balancing platform in front of ingress. On bare-metal Kubernetes without a cloud-provided load balancer, an implementation such as MetalLB (L2 or BGP mode) can fill that role; confirm operational fit with your platform team before treating it as production-default.
👥 Application replicas	Scale stateless frontend/API/backend services to at least two replicas. Keep workflow and ETL-style services singleton unless Exivity confirms a scaling pattern for your workload.
📆 Scheduling	Use node anti-affinity or topology spread constraints where the platform supports it.

Service replica guidance

The following is a conservative starting point. Scale after observing CPU, memory, queue depth, and report preparation behavior.

Service	Starting replicas	Notes
`glass`	2	Stateless UI.
`proximityApi`	2	Stateless API; scale horizontally behind ingress.
`edify`, `horizon`, `pigeon`, `transcript`, `use`	2	Pull work from RabbitMQ queues (`REPORT`, `BUDGET`, `PIGEON`/`WORKFLOW_EVENT`/`REPORT_PUBLISHED`, `TRANSFORM`, and `EXTRACT` respectively). RabbitMQ delivers each queued job to one consumer, so multiple replicas distribute load and increase throughput.
`chronos`, `executor`, `griffon`, `proximityCli`	1	Must remain singletons. These services own scheduling, workflow dispatch, and CLI execution, where multiple replicas would duplicate work.

Avoid running the same job twice concurrently

RabbitMQ ensures each queued message is delivered once, but it does not stop you from queueing the same logical task (the same extractor, transformer, or report for the same period) twice. Running the same task concurrently can produce overlapping writes to extracted, exported, or report, regardless of how many replicas a service has. This is a workflow-design concern, not a replica-count concern: design schedules and triggers so the same task for the same period is not enqueued in parallel.

Reference values: charts/exivity/examples/best-practice-multi-node.yaml

Scenario C: multi-site active/passive

For deployments spanning multiple physical sites, Exivity recommends active/passive. The active site runs the application and middleware. The passive site continuously receives replicated data and is promoted during a failover event.

Active/passive avoids the operational complexity of active/active PostgreSQL writes, RabbitMQ stretching, Longhorn stretching, and workflow execution conflicts.

Architecture

This table describes the role of each layer in the diagram above, comparing the active and passive sites side by side. It is descriptive, not prescriptive.

Layer	Active site	Passive site
🚦 Traffic routing	DNS, GSLB, or load balancer sends users to the active ingress.	Standby ingress is prepared for failover but does not receive normal traffic.
🧩 Application tier	Exivity service replicas are greater than `0`.	Exivity service replicas remain `0` until failover.
🐘 PostgreSQL	Runs the primary database endpoint.	Receives replicated data or restores from validated backups before promotion.
🐇 RabbitMQ	Runs an independent site-local RabbitMQ instance.	Runs a separate site-local RabbitMQ instance; RabbitMQ state is not replicated.
💾 Storage	Uses site-local shared storage and backup replication.	Uses independent site-local storage restored or attached during failover.
☸️ GitOps	Controls scaling, routing, and failover changes through versioned state.	Promotes the site through the same repeatable workflow.

Configuration

This table lists the choices to make for a multi-site active/passive deployment. It is prescriptive.

Decision	Recommended value
🧩 Application	Run Exivity only in the active site. Keep passive-site application replicas at `0` until failover.
🐘 PostgreSQL	Use active/passive PostgreSQL replication. For CloudNativePG, use a replica cluster or a supported backup/restore promotion pattern.
🐇 RabbitMQ	Do not stretch RabbitMQ across sites. Deploy one site-local RabbitMQ instance per site. RabbitMQ state is not replicated between sites.
💾 Longhorn / storage	Do not stretch a Longhorn cluster across sites. Use independent Longhorn clusters per site and replicate data through backups or storage-layer replication supported by your platform.
🚦 DNS / load balancing	Use DNS, GSLB, or your load balancing platform to route users to the active site.
☸️ Failover control	Use GitOps for repeatable failover. Argo CD with Argo Workflows or Argo Events is the preferred implementation pattern.

Required GitOps failover pattern

A multi-site deployment must have a version-controlled, tested failover workflow. The workflow should perform the following actions in order:

Mark Site A unavailable and stop routing new traffic to it.
Scale Site A Exivity application replicas to 0 if the cluster is reachable.
Promote the Site B PostgreSQL replica or restore the latest validated backup, depending on the PostgreSQL design.
Ensure Site B RabbitMQ is available and configured for Exivity.
Restore or attach the required Site B shared data volumes.
Scale Site B Exivity application replicas above 0.
Switch DNS, GSLB, or load balancer traffic to Site B.
Run application validation checks before handing the service back to users.

If you do not have GitOps practices, implement this as a documented runbook, but understand that this is not the preferred operating model. For best-practice multi-site deployments, GitOps is required to reduce failover risk and make the process repeatable.

Reference values: charts/exivity/examples/best-practice-multi-site-active-passive.yaml

Active/active across sites

Active/active across sites is discouraged and should not be used as the default architecture.

The main concerns are:

Concern	Impact
🐘 PostgreSQL write conflicts	Bidirectional PostgreSQL replication is complex and can introduce conflict handling requirements that Exivity does not need in active/passive mode.
📆 Workflow scheduling	Only one site should execute workflows unless there is a clear leader-election or workload partitioning design. Otherwise, work may be duplicated or events may not progress as expected.
🐇 RabbitMQ stretching	RabbitMQ clusters should not be stretched across high-latency links for this use case.
💾 Storage stretching	Longhorn should not be stretched across sites. Site-local storage is simpler and safer.
⏱️ Latency	WAN latency to PostgreSQL can significantly affect report preparation and other database-heavy operations.

If you need active/active, treat it as a custom architecture and involve Exivity engineering before committing to the design.

Middleware recommendations

The middleware products in this section are third-party dependencies. Exivity requires compatible database, message queue, storage, networking, backup, and monitoring services, but the operation and support of those services remains your responsibility or that of your chosen platform/vendor.

PostgreSQL

PostgreSQL is the most important stateful dependency. Production deployments should use external PostgreSQL rather than the embedded Bitnami dependency shipped with the chart.

Recommended options:

Option	Recommendation
🐘 Managed or standard PostgreSQL	Preferred where you already operate a supported HA PostgreSQL platform.
☸️ CloudNativePG	Recommended for self-hosted PostgreSQL on Kubernetes. See the CloudNativePG documentation.
🐘 Embedded Bitnami PostgreSQL	Acceptable for evaluation and small single-node deployments only. Not recommended for production HA.

Starting recommendations:

Setting	Recommendation
💾 Storage	25-50Gi minimum. Monitor and expand before reaching 70% utilization.
🔁 Replication	Use active/passive HA within a site or across sites.
🔄 Backups	Use PostgreSQL-native backups. For CloudNativePG, use Barman Cloud to S3-compatible object storage where available.
🛡️ TLS	Use TLS for database traffic where supported by your platform.
🌐 Latency	Keep Exivity and PostgreSQL in the same low-latency site for active workloads. WAN latency around 15ms or higher can materially affect report preparation.

RabbitMQ

Exivity uses RabbitMQ for transient application communication and work coordination, not as the primary system of record. Data integrity is primarily tied to PostgreSQL and shared data volumes. For this reason, a site-local in-cluster RabbitMQ deployment is the default recommendation. If RabbitMQ fails, Kubernetes can reschedule it, and interrupted work can be retried without introducing an external middleware dependency. External or managed RabbitMQ is optional when required by your platform standards.

Bitnami RabbitMQ chart end-of-life

The Exivity Helm chart's embedded RabbitMQ dependency is based on the Bitnami RabbitMQ Helm chart. Following the Bitnami container catalog changes on September 29, 2025, the chart consumes the unsupported bitnamilegacy/rabbitmq image through the Exivity-hosted Bitnami mirror. This is a temporary compatibility measure and will not receive updates or security patches.

Treat the embedded RabbitMQ as suitable only for evaluation and small single-node deployments. For production, keep RabbitMQ site-local but run it outside the Exivity chart, preferably with the RabbitMQ Cluster Operator.

How to run site-local RabbitMQ:

Implementation	When to use	Notes
🐇 RabbitMQ Cluster Operator	Production default for site-local RabbitMQ	Maintained upstream by the RabbitMQ team; uses official `rabbitmq` images; declarative `RabbitmqCluster` CRD; supported queue types and policies.
🐇 RabbitMQ Messaging Topology Operator	Optional alongside the Cluster Operator	Lets you manage vhosts, users, queues, exchanges, bindings, and policies as Kubernetes resources. See the Messaging Topology Operator overview.
🐇 Embedded chart dependency	Evaluation and small single-node only	Based on the Bitnami chart and `bitnamilegacy` image; do not treat as a long-term production architecture.
🐇 Managed RabbitMQ	Optional, when required by platform standards	Not site-local; only choose this when you already operate a managed RabbitMQ platform. Connect Exivity through the external `rabbitmq.host`, `rabbitmq.port`, `rabbitmq.vhost`, and `rabbitmq.secure` values.

Recommended options by scenario:

Scenario	Recommendation
🐇 Single-node	Use site-local in-cluster RabbitMQ. The embedded chart is acceptable for evaluation; prefer the RabbitMQ Cluster Operator (single node) for long-term deployments.
🐇 Multi-node single-site	Use site-local in-cluster RabbitMQ via the RabbitMQ Cluster Operator. External or managed RabbitMQ is optional when required by your standards.
🐇 Multi-site	Run one independent site-local RabbitMQ deployment per site. Do not stretch RabbitMQ across sites and do not replicate RabbitMQ state between sites. The RabbitMQ Cluster Operator is the preferred way to run each site-local deployment.

Starting recommendations:

Setting	Recommendation
🐇 Clustering	Keep clustering disabled by default. Only enable clustering for a dedicated multi-node RabbitMQ design.
🐇 Queues	Prefer quorum queues for new RabbitMQ designs where compatible with your RabbitMQ version and policy model.
💾 Persistence	Use persistent storage for production RabbitMQ.
📈 Monitoring	Monitor queue depth, memory, disk free space, and connection count.

Confirm final RabbitMQ values against your chosen RabbitMQ deployment method before applying production tuning.

Longhorn

Longhorn is the preferred storage provider for HA Exivity Kubernetes deployments when it is available in your environment. It is considered mature enough for production use and is generally more resilient than a standard in-cluster NFS provisioner in HA environments.

Recommended options:

Scenario	Recommendation
💾 Single-node	RWX is not required. Use a provisioner-backed local StorageClass (Docker Desktop `hostpath`, K3s `local-path`, or `local-path-provisioner`) with `storage.sharedVolumeAccessMode: ReadWriteOnce`. NAS/NFS is a valid alternative when you already operate reliable NAS or want storage decoupled from the node, but does not by itself make Exivity HA when Kubernetes is single-node. Longhorn is possible but provides limited HA value on one node because replicas cannot be spread across nodes.
💾 Multi-node single-site	Prefer Longhorn with three replicas per volume.
💾 Multi-site	Use one independent Longhorn deployment per site. Do not stretch Longhorn across sites.

Starting recommendations:

Setting	Recommendation
💾 Replicas	Configure three replicas per volume for HA environments.
☸️ Replica placement	Spread replicas across nodes and failure domains where possible.
🔄 Backups	Configure recurring snapshots and recurring backups to an S3-compatible or otherwise approved backup target.
📊 Capacity	Size disks for usable capacity after three-way replication and snapshot overhead.
💾 RWX	Validate RWX behavior before production, including share-manager scheduling and failover.

Confirm final Longhorn StorageClass and recurring job values against the current chart before applying production tuning.

NFS

When NFS is used as RWX storage for Exivity, deploy the NFS Ganesha server and external provisioner (the nfs-server-provisioner Helm chart) rather than an unspecified NFS server. It serves NFSv4 with file locking, which Exivity requires, and is the reference NFS provisioner used in this documentation.

This in-cluster provisioner can work well for smaller or simpler deployments. For HA environments, prefer an external HA NAS platform that exposes NFSv4, because the in-cluster provisioner becomes a single point of failure unless its backing storage and node placement are explicitly designed for HA.

Operations checklist

Before production

Check	Requirement
💾 Storage	RWX storage class validated with Exivity PVCs.
📶 PVC sizes	Production PVC sizes set explicitly.
🐘 PostgreSQL	HA design, backups, restore, and monitoring validated.
🐇 RabbitMQ	Connectivity, authentication, TLS, and monitoring validated.
🚦 Ingress / load balancer	DNS, TLS certificate, ingress class, and trusted proxy behavior validated.
🔐 Secrets	Production `secret.appKey`, `secret.jwtSecret`, PostgreSQL password, and RabbitMQ password configured.
🔄 Backups	Restore test completed.
📈 Monitoring	Cluster, application, PostgreSQL, RabbitMQ, ingress, and storage alerts configured.

Day-2 operations

Area	Recommendation
⎈ Upgrades	Run Helm upgrades from version-controlled values. Back up PostgreSQL before upgrades.
📊 Capacity	Monitor PostgreSQL, `extracted`, `exported`, and log volume growth.
📄 Logs	Lower retention before expanding log PVCs unnecessarily.
🔁 DR	Test failover regularly for multi-site deployments.
🔐 Security	Rotate credentials according to your security policy and keep images patched.

Example values files

Use these example files as starting points, not as final production values:

Scenario	File
Single-node	`charts/exivity/examples/best-practice-single-node.yaml`
Multi-node	`charts/exivity/examples/best-practice-multi-node.yaml`
Multi-site active/passive	`charts/exivity/examples/best-practice-multi-site-active-passive.yaml`

Exivity Kubernetes best practices

Deployment scenarios​

Common foundations​

PVC sizing​

Scenario A: single-node Kubernetes​

Architecture​

Configuration​

Scenario B: multi-node Kubernetes, single site​

Architecture​

Configuration​

Service replica guidance​

Scenario C: multi-site active/passive​

Architecture​

Configuration​

Required GitOps failover pattern​

Active/active across sites​

Middleware recommendations​

PostgreSQL​

RabbitMQ​

Longhorn​

NFS​

Operations checklist​

Before production​

Day-2 operations​

Example values files​

Deployment scenarios

Common foundations

PVC sizing

Scenario A: single-node Kubernetes

Architecture

Configuration

Scenario B: multi-node Kubernetes, single site

Architecture

Configuration

Service replica guidance

Scenario C: multi-site active/passive

Architecture

Configuration

Required GitOps failover pattern

Active/active across sites

Middleware recommendations

PostgreSQL

RabbitMQ

Longhorn

NFS

Operations checklist

Before production

Day-2 operations

Example values files